Part I: Summarizing Multiple Variables (using 2022 Michigan exit poll data)

Our tools depend on the type of variables we are trying to graph.

Returning to the “gender” gap

Conditional variation involves examining how the values of two or more variables are related to one another. Earlier we made these comparisons by creating different tibbles and then comparing across tibbles, but we can also make comparisons without creating multiple tibbles. So load in the Michigan 2020 Exit Poll Data.

library(tidyverse)
library(scales)
mi_ep <- read_rds("https://github.com/rweldzius/PSC7000_F2024/raw/main/Data/MI2020_ExitPoll_small.rds")
MI_final_small <- mi_ep %>%
  filter(preschoice=="Donald Trump, the Republican" | preschoice=="Joe Biden, the Democrat") %>%
  mutate(BidenVoter=ifelse(preschoice=="Joe Biden, the Democrat",1,0),
         TrumpVoter=ifelse(BidenVoter==1,0,1),
         AGE10=ifelse(AGE10==99,NA,AGE10))

We learned that if we count using multiple variables that R will count within values. Can we use this to analyze how this varies by groups? Let’s see!

MI_final_small %>%
  filter(AGE10==1) %>%
  count(preschoice,SEX) %>%
  mutate(PctSupport = n/sum(n),
         PctSupport = round(PctSupport, digits=2))
## # A tibble: 4 × 4
##   preschoice                     SEX     n PctSupport
##   <chr>                        <dbl> <int>      <dbl>
## 1 Donald Trump, the Republican     1     7       0.22
## 2 Donald Trump, the Republican     2     1       0.03
## 3 Joe Biden, the Democrat          1     9       0.28
## 4 Joe Biden, the Democrat          2    15       0.47

Here we have broken everything out by both preschoice and SEX but the PctSupport is not quite what we want because it is the fraction of responses (out of 1) that are in each row rather than the proportion of support for each candidate by sex.

To correct this and to perform the functions within a value we need to use the group_by function.

We can use the group_by command to organize our data a bit better. What group_by does is to run all subsequent code separately according to the defined group.

So instead of running a count or summarize separately for both Males and Females as we did above, we can group_by the variable SEX.chr (or FEMALE or SEX – it makes no difference as they are all equivalent) and then preform the subsequent commands. So here we are going to filter to select those who are 24 and below and then we are going to count the number of Biden and Trump supporters within each value of SEX.chr

MI_final_small %>%
  filter(AGE10==1) %>%
  group_by(SEX) %>%
  count(preschoice)
## # A tibble: 4 × 3
## # Groups:   SEX [2]
##     SEX preschoice                       n
##   <dbl> <chr>                        <int>
## 1     1 Donald Trump, the Republican     7
## 2     1 Joe Biden, the Democrat          9
## 3     2 Donald Trump, the Republican     1
## 4     2 Joe Biden, the Democrat         15

Note that any functions of the data are also now organized by that grouping, so if we were to manually compute the proportions using the mutation approach discussed above we would get:

MI_final_small %>%
  filter(AGE10==1) %>%
  group_by(SEX) %>%
  count(preschoice) %>%
  mutate(PctSupport = n/sum(n),
         PctSupport = round(PctSupport, digits=2))
## # A tibble: 4 × 4
## # Groups:   SEX [2]
##     SEX preschoice                       n PctSupport
##   <dbl> <chr>                        <int>      <dbl>
## 1     1 Donald Trump, the Republican     7       0.44
## 2     1 Joe Biden, the Democrat          9       0.56
## 3     2 Donald Trump, the Republican     1       0.06
## 4     2 Joe Biden, the Democrat         15       0.94

So you can see that PctSupport sums to 2.0 because it sums to 1.0 within each value of the grouping variable SEX.

If we wanted the fraction of voters who are in each unique category - so that the percentage of all the categories sum to 1.0 – we would want to ungroup before doing the mutation that calculates the percentage. So here we are doing the functions after the group_by() separately for each value of the grouping variables (here SEX) and then we are going to then undo that and return to the entire dataset.

MI_final_small %>%
  filter(AGE10==1) %>%
  group_by(SEX) %>%
  count(preschoice) %>%
  ungroup() %>%
  mutate(PctSupport = n/sum(n),
         PctSupport = round(PctSupport, digits=2))
## # A tibble: 4 × 4
##     SEX preschoice                       n PctSupport
##   <dbl> <chr>                        <int>      <dbl>
## 1     1 Donald Trump, the Republican     7       0.22
## 2     1 Joe Biden, the Democrat          9       0.28
## 3     2 Donald Trump, the Republican     1       0.03
## 4     2 Joe Biden, the Democrat         15       0.47

If we are just interested in the proportion and we do not care about the number of respondents in each value (although here it seems relevant!) we could also group_by and then summarize as follows:

MI_final_small %>%
  filter(AGE10==1) %>%
  group_by(SEX) %>%
  summarize(PctBiden = mean(BidenVoter),
          PctTrump = mean(TrumpVoter)) %>%
  mutate(PctBiden = round(PctBiden, digits =2),
         PctTrump = round(PctTrump, digits =2))
## # A tibble: 2 × 3
##     SEX PctBiden PctTrump
##   <dbl>    <dbl>    <dbl>
## 1     1     0.56     0.44
## 2     2     0.94     0.06

Because we have already filtered to focus only on Biden and Trump voters, we don’t actually need both since PctBiden = 1 - PctTrump and PctTrump = 1 - PctBiden.

Note that we can have multiple groups. So if we want to group by age and by sex we can do the following…

MI_final_small %>%
  group_by(SEX, AGE10) %>%
  summarize(PctBiden = mean(BidenVoter)) %>%
  mutate(PctBiden = round(PctBiden, digits =2))
## # A tibble: 22 × 3
## # Groups:   SEX [2]
##      SEX AGE10 PctBiden
##    <dbl> <dbl>    <dbl>
##  1     1     1     0.56
##  2     1     2     0.58
##  3     1     3     0.58
##  4     1     4     0.76
##  5     1     5     0.58
##  6     1     6     0.42
##  7     1     7     0.46
##  8     1     8     0.56
##  9     1     9     0.61
## 10     1    10     0.57
## # ℹ 12 more rows

We can also save it for later analysis and then filter or select the results. For example:

SexAge <- MI_final_small %>%
  group_by(SEX, AGE10) %>%
  summarize(PctBiden = mean(BidenVoter)) %>%
  mutate(PctBiden = round(PctBiden, digits =2)) %>%
  drop_na()

So if we want to look at the Biden support by age among females (i.e., SEX==2) we can look at:

SexAge %>%
  filter(SEX == 2)
## # A tibble: 10 × 3
## # Groups:   SEX [1]
##      SEX AGE10 PctBiden
##    <dbl> <dbl>    <dbl>
##  1     2     1     0.94
##  2     2     2     0.93
##  3     2     3     0.69
##  4     2     4     0.71
##  5     2     5     0.52
##  6     2     6     0.6 
##  7     2     7     0.63
##  8     2     8     0.74
##  9     2     9     0.69
## 10     2    10     0.61

And for Men…

Dicrete Variable By Discrete Variable (Barplot)

If we are working with discrete/categorical/ordinal/data — i.e., variables that take on a finite (and small) number of unique values then we are interested in how to compare across bar graphs.

Before we used geom_bar to plot the number of observations associated with each value of a variable. But we often want to know how the number of observations may vary according to a second variable. For example, we care not only about why voters reported that they supported Biden or Trump in 2020 but we are also interested in knowing whether Biden and Trump voters were voting for similar or different reasons. Did voters differ in terms of why they were voting for a candidate in addition to who they were voting for? If so, this may suggest something about what each set of voters were looking for in a candidate.

Let’s first plot the barplot and then plot the barplot by presidential vote choice for the Michigan Exit Poll we were just analyzing.

We are interested in the distribution of responses to the variable Quality and we only care about voters who voted for either Biden or Trump (preschoice) so let’s select those variables and filter using preschoice to select those respondents. We have an additional complication that the question was only asked of half of the respondents and some that were asked refused to answer. To remove these respondents we want to drop_na (note that this will drop every observation with a missing value – this is acceptable because we have used select to focus on the variables we are analyzing, but if we did not use select it would have dropped an observation with missing data in any variable. We could get around this using drop_na(Quality) if we wanted). A final complication is that some respondents did not answer the question they were asked so we have to use filter to remove respondents with missing observations.

Now we include labels – note how we are suppressing the x-label because the value labels are self-explanatory in this instance and add the geom_bar as before.

mi_ep %>% 
    select(Quality,preschoice) %>%
    filter(preschoice == "Joe Biden, the Democrat" | preschoice == "Donald Trump, the Republican") %>%
    drop_na() %>%
    filter(Quality != "[DON'T READ] Don’t know/refused") %>%
    ggplot(aes(x= Quality)) +     
    labs(y = "Number of Voters",
         x = "",
         title = "Michigan 2020 Exit Poll: Reasons for voting for a candidate") +
    geom_bar(color="black") 

Note that if we add coord_flip that we will flip the axes of the graph. (We could also have done this by changing aes(y= Quality), but then we would also have to change the associated labels.)

mi_ep %>% 
    select(Quality,preschoice) %>%
    filter(preschoice == "Joe Biden, the Democrat" | preschoice == "Donald Trump, the Republican") %>%
    drop_na() %>%
    filter(Quality != "[DON'T READ] Don’t know/refused") %>%
    ggplot(aes(x= Quality)) +     
    labs(y = "Number of Voters",
         x = "",
         title = "Michigan 2020 Exit Poll: Reasons for voting for a candidate") +
    geom_bar(color="black") + 
  coord_flip()

So enough review, lets add another dimension to the data. To show how the self-reported reasons for voting for a presidential candidate varied by vote choice we are going to use the fill of the graph to create different color bars depending on the value of the character or factor variable that is used to fill.

So we are going to include as a ggplot aesthetic a character or factor variable as a fill (here fill=preschoice) and then we are going to also include fill in the labs function to make sure that we label the meaning of the values being plotted. The other change we have made is in geom_bar where we used position=dodge to make sure that the bars are plotted next to one-another rather than on top of one another.

mi_ep %>% 
    select(Quality,preschoice) %>%
    filter(preschoice == "Joe Biden, the Democrat" | preschoice == "Donald Trump, the Republican") %>%
    drop_na() %>%
    filter(Quality != "[DON'T READ] Don’t know/refused") %>%
    ggplot(aes(x= Quality, fill = preschoice)) +     
    labs(y = "Number of Voters",
         x = "",
         title = "Michigan 2020 Exit Poll: Reasons for voting for a candidate",
         fill = "Self-Reported Vote") +
    geom_bar(color="black", position="dodge") +
    coord_flip()

For fun, see what happens when you do not use postion=dodge. Also see what happens if you do not flip the coordinates using coord_flip.

It is important to note that the fill variable has to be a character or a factor. If we want to graph self-reported vote by sex, for example, we need to redefine the variable for the purposes of ggplot as follows. Note that because we are not mutating it and we are only defining it to be a factor within the `ggplot object, this redefinition will not stick. Note also the problem caused by uninformative values in SEX – can you change it.

mi_ep %>% 
    filter(preschoice == "Joe Biden, the Democrat" | preschoice == "Donald Trump, the Republican") %>%
    ggplot(aes(x= preschoice, fill = factor(SEX))) +     
    labs(y = "Number of Respondents",
         x = "",
         title = "Vote by Respondent Sex",
         fill = "Sex") +
    geom_bar(color="black", position="dodge") +
    coord_flip()

Quick Exercise The barplot we just produced does not satisfy our principles of visualization because the fill being used is uninterpretable to those unfamiliar with the dataset. Redo the code to use a fill variable that produces an informative label. Hint: don’t overthink.

# INSERT CODE HERE

Part II: Does Polling Mode Matter?

Suppose we were concerned with whether some polls might give different answers because of variation in who the poll is able to reach using that method. People who take polls via landline phones (do you even know what that is?) might differ from those who take surveys online. Or people contacted using randomly generated phone numbers (RDD) may differ from those contacted from a voter registration list that has had telephone numbers merged onto it.

Polls were done using lots of different methods in 2020.

Loading the data

require(tidyverse)
Pres2020.PV <- read_rds(file="https://github.com/rweldzius/PSC7000_F2024/raw/main/Data/Pres2020_PV.Rds")
Pres2020.PV <- Pres2020.PV %>%
                mutate(Trump = Trump/100,
                      Biden = Biden/100,
                      margin = Biden - Trump)
Pres2020.PV %>%
  count(Mode)
## # A tibble: 9 × 2
##   Mode                 n
##   <chr>            <int>
## 1 IVR                  1
## 2 IVR/Online          47
## 3 Live phone - RBS    13
## 4 Live phone - RDD    51
## 5 Online             366
## 6 Online/Text          1
## 7 Phone - unknown      1
## 8 Phone/Online        19
## 9 <NA>                29

This raises the question of – how do we visualization variation in a variable by another variable? More specifically, how can we visualize how the margin we get using one type of survey compares to the margin from another type of poll? (We cannot use a scatterplot because the data is from different observations (here polls).)

We could do this using earlier methods by selecting polls with a specific interview method (“mode”) and then plotting the margin (or Trump or Biden), but that will produce a bunch of separate plots that may be hard to directly compare. (In addition to having more things to look at we would want to make sure that the scale of the x-axis and y-axis are similar.)

We can plot another “layer” of data in `ggplot using the fill paramter. Previously we used it to make the graphs look nice by choosing a particular color. But if we set fill to be a variable in our tibble then ggplot will plot the data seperately for each unique value in the named variable. So if we want to plot the histogram of margin for two types of polls we can use the fill argument in ggplot to tell R to produce different fills depending on the value of that variable.

Pres2020.PV %>% 
  filter(Mode == "IVR/Online" | Mode == "Live phone - RDD") %>%
    ggplot(aes(x= margin, fill = Mode)) +     
  labs(y = "Number of Polls",
         x = "Biden- Trump Margin",
         title = "Biden-Trump Margin for Two Types of Polls",
        fill = "Mode of Interview") +
    geom_histogram(bins=10, color="black", position="dodge") + 
    scale_x_continuous(breaks=seq(-.1,.2,by=.05),
                     labels= scales::percent_format(accuracy = 1))

Quick Exercise Try running the code without the filter. What do you observe? How useful is this? Why or why not?

# INSERT CODE

While informative, it can be hard to compare the distribution of more than two categories using such methods. To compare the variation across more types of surveys we need to use a different visualization that summarizes the variation in the variable of interest a bit more. One common visualization is the boxplot which reports the mean, 25th percentile (i.e., the value of the data if we sort the data from lowest to highest and take the value of the observation that is 25% of the way through), the 75th percentile, the range of values, and notable outliers.

Let’s see what the boxplot of survey mode looks like after we first drop surveys that were conducted using modes that were hardly used (or missing).

Pres2020.PV %>% 
  filter(Mode != "IVR" & Mode != "Online/Text" & Mode != "Phone - unknown" & Mode != "NA") %>%
  ggplot(aes(x = Mode, y = margin)) + 
    labs(x = "Mode of Survey Interview",
         y = "Biden- Trump Margin",
         title = "2020 Popular Vote Margin by Type of Poll") +
    geom_boxplot(fill = "slateblue") +
    scale_y_continuous(breaks=seq(-.1,.2,by=.05),
                     labels= scales::percent_format(accuracy = 1))

We can also flip the graph if we think it makes more sense to display it in a different orientation using coord_flip. (We could, of course, also redefine the x and y variables in the `ggplot object, but it is useful to have a command to do this to help you determine which orientiation is most useful).

Pres2020.PV %>% 
  filter(Mode != "IVR" & Mode != "Online/Text" & Mode != "Phone - unknown" & Mode != "NA") %>%
  ggplot(aes(x = Mode, y = margin)) + 
    labs(x = "Mode of Survey Interview",
         y = "Biden- Trump Margin",
         title = "2020 Popular Vote Margin by Type of Poll") +
    geom_boxplot(fill = "slateblue") +
    scale_y_continuous(breaks=seq(-.1,.2,by=.05),
                     labels= scales::percent_format(accuracy = 1)) +
    coord_flip()

A downside of the boxplot is that it can be hard to tell how the data varies within each box. Is it equally spread out? How much data are contained in the lines (which are simply 1.5 times the height of the box)? To get a better handle on this we can use a “violin” plot that dispenses with a standard box and instead tries to plot the distribution of data within each category.

Pres2020.PV %>% 
  filter(Mode != "IVR" & Mode != "Online/Text" & Mode != "Phone - unknown" & Mode != "NA") %>%
  ggplot(aes(x=Mode, y=margin)) + 
    xlab("Mode") + 
    ylab("Biden- Trump Margin") +
    geom_violin(fill="slateblue")

It is also hard to know how much data is being plotted. If some modes have 1000 polls and others have only 5 that seems relevant.

Quick Exercise We have looked at the difference in margin. How about differences in the percent who report supporting Biden and Trump? What do you observe. Does this suggest that the different ways of contacting respondents may matter in terms of who responds? Is there something else that may explain the differences (i.e., what are we assuming when making this comparison)?

# INSERT CODE HERE

Quick Exercise Some claims have been made that polls that used multiple ways of contacting respondents were better than polls that used just one. Can you evaluate whether there were differences in so-called “mixed-mode” surveys compared to single-mode surveys? (This requires you to define a new variable based on Mode indicating whether survey is mixed-mode or not.)

# INSERT CODE HERE

Continuous Variable By Continuous Variable (Scatterplot)

When we have two continuous variables we use a scatterplot to visualize the relationship. A scatterplot is simply a graph of every point in (x,y) where x is the value associated with the x-variable and y is the value associated with the y-variable. For example, we may want to see how support for Trump and Biden within a poll varies. So each observation is a poll of the national popular vote and we are going to plot the percentage of respondents in each poll supporting Biden against the percentage who support Trump.

To include two variables we are going to change our aesthetic to define both an x variable and a y variable – here aes(x = Biden, y = Trump) and we are going to label and scale the axes appropriately.

Pres2020.PV %>%
  ggplot(aes(x = Biden, y = Trump)) + 
  labs(title="Biden and Trump Support in 2020 National Popular Vote",
       y = "Trump Support",
       x = "Biden Support") + 
  geom_point(color="purple") + 
    scale_y_continuous(breaks=seq(0,1,by=.05),
                     labels= scales::percent_format(accuracy = 1)) +
  scale_x_continuous(breaks=seq(0,1,by=.05),
                     labels= scales::percent_format(accuracy = 1))

The results are intriguing! First the data seems like it falls along a grid. This is because of how poll results are reported in terms of percentage points and it highlights that even continuous variables may be reported in discrete values. This is consequential because it is hard to know how many polls are associated with each point on the graph. How many polls are at the point (Biden 50%, Trump 45%)? This matters for trying to determine what the relationship might be. Second, it is clear that there are some questions that need to be asked – why doesn’t Biden + Trump = 100\%?

To try to display how many observations are located at each point we have two tools at our disposal. First, we can alter the “alpha transparency” by setting alpha-.5 in the geom_point call. By setting a low level of transparency, this means that the point will become less transparent as more points occur at the same coordinate. Thus, a faint point indicates that only a single poll (observation) is located at a coordinate whereas a solid point indicates that there are many polls. When we apply this to the scatterplot you can immediately see that most of the polls are located in the neighborhood of Biden 50%, Trump 42%.

Pres2020.PV %>%
  ggplot(aes(x = Biden, y = Trump)) + 
  labs(title="Biden and Trump Support in 2020 National Popular Vote",
       y = "Trump Support",
       x = "Biden Support") + 
  geom_point(color="purple",alpha = .3) + 
    scale_y_continuous(breaks=seq(0,1,by=.05),
                     labels= scales::percent_format(accuracy = 1)) +
  scale_x_continuous(breaks=seq(0,1,by=.05),
                     labels= scales::percent_format(accuracy = 1))

However, the grid-like nature of the plot is still somewhat hard to interpret as it can be hard to discern variations in color gradient. Another tool is to add a tiny bit of randomness to the x and y values associated with each plot. Instead of values being constrained to vary by a full percentage point, for example, the jitter allows it to vary by less. To do so we replace geom_point with geom_jitter.

Pres2020.PV %>%
  ggplot(aes(x = Biden, y = Trump)) + 
  labs(title="Biden and Trump Support in 2020 National Popular Vote",
       y = "Trump Support",
       x = "Biden Support") + 
  geom_jitter(color="purple",alpha = .5) + 
    scale_y_continuous(breaks=seq(0,1,by=.05),
                     labels= scales::percent_format(accuracy = 1)) +
  scale_x_continuous(breaks=seq(0,1,by=.05),
                     labels= scales::percent_format(accuracy = 1)) 

Note how much the visualization changes. Whereas before the eye was focused on – and arguably distracted by – the grid-like orientation imposed by the measurement, once we jitter the points we are immediately made aware of the relationship between the two variables. While we are indeed slightly changing our data by adding random noise, the payoff is that the visualization arguably better highlights the nature of the relationship. Insofar the goal of visualization is communication, this trade-off seems worthwhile in this instance. But here again is where data science is sometimes art as much as science. The decision of which visualization to use depends on what you think most effectively communicates the nature of the relationship to the reader.

We can also look at the accuracy of a poll as a function of the sample size. This is also a relationship between two continuous variables – hence a scatterplot! Are polls with more respondents more accurate? There is one poll with nearly 80,000 respondents that we will filter out to me able to show a reasonable scale. Note that we are going to use labels = scales::comma when plotting the x-axis to report numbers with commas for readability.

Pres2020.PV %>%
  filter(SampleSize < 50000) %>%
  mutate(TrumpError = Trump - RepCertVote/100,
         BidenError = Biden - DemCertVote/100) %>%
  ggplot(aes(x = SampleSize, y = TrumpError)) + 
  labs(title="Trump Polling Error in 2020 National Popular Vote as a function of Sample Size",
       y = "Error: Trump Poll - Trump Certified Vote",
       x = "Sample Size in Poll") + 
  geom_jitter(color="purple",alpha = .5) +
  scale_y_continuous(breaks=seq(-.2,1,by=.05),
                     labels= scales::percent_format(accuracy = 1)) +
  scale_x_continuous(breaks=seq(0,30000,by=5000),
                     labels= scales::comma) 

In sum, we have tested Trump’s theory that the MSM was biased against him. We found that polls that underpredicted Trump also underpredicted Biden. This is not what we would expect if the polls favored one candidate over another.

Pres2020.PV %>%
  ggplot(aes(x = Biden, y = Trump)) + 
  labs(title="Biden and Trump Support in 2020 National Popular Vote",
       y = "Trump Support",
       x = "Biden Support") + 
  geom_jitter(color="purple",alpha = .5) + 
    scale_y_continuous(breaks=seq(0,1,by=.05),
                     labels= scales::percent_format(accuracy = 1)) +
  scale_x_continuous(breaks=seq(0,1,by=.05),
                     labels= scales::percent_format(accuracy = 1)) 

What is an alternative explanation for these patterns? Why would polls underpredict both Trump and Biden?

Perhaps they were fielded earlier in the year, when more people were interested in third party candidates, or hadn’t made up their mind. We’ll turn to testing this theory next time!

Part III: Why did polls underpredict both Trump and Biden?

Recap

We just tested Trump’s theory that the MSM was biased against him. We found that polls that underpredicted Trump also underpredicted Biden. This is not what we would expect if the polls favored one candidate over another.

Loading the data

require(tidyverse)
require(scales)
Pres2020.PV <- read_rds(file="https://github.com/rweldzius/PSC7000_F2024/raw/main/Data/Pres2020_PV.Rds")
Pres2020.PV <- Pres2020.PV %>%
                mutate(Trump = Trump/100,
                      Biden = Biden/100,
                      margin = Biden - Trump)
Pres2020.PV %>%
  ggplot(aes(x = Biden, y = Trump)) + 
  labs(title="Biden and Trump Support in 2020 National Popular Vote",
       y = "Trump Support",
       x = "Biden Support") + 
  geom_jitter(color="purple",alpha = .5) + 
    scale_y_continuous(breaks=seq(0,1,by=.05),
                     labels= scales::percent_format(accuracy = 1)) +
  scale_x_continuous(breaks=seq(0,1,by=.05),
                     labels= scales::percent_format(accuracy = 1)) 

What is an alternative explanation for these patterns? Why would polls underpredict both Trump and Biden?

Perhaps they were fielded earlier in the year, when more people were interested in third party candidates, or hadn’t made up their mind.

Visualizing More Dimensions – And Introducing Dates!

How did the support for Biden and Trump vary across the course of the 2020 Election?

Give you some tools to do some amazing things!

Telling Time

Dates in R

election.day <- as.Date("11/3/2020", "%m/%d/%Y")   
election.day16 <- as.Date("11/8/2016", "%m/%d/%Y")   
election.day - election.day16
## Time difference of 1456 days
as.numeric(election.day - election.day16)
## [1] 1456

Initial Questions

So, for every day, how many polls were reported by the media?

Note that we could also look to see if the poll results depend on how the poll is being done (i.e., the Mode used to contact respondents) or even depending on who funded (Funded) or conducted (Conducted) the polls. We could also see if larger polls (i.e., polls with more respondents SampleSize) or polls that took longer to conduct (DaysInField) were more or less accurate.

Let’s Wrangle…

Pres2020.PV <- Pres2020.PV %>%
                mutate(EndDate = as.Date(EndDate, "%m/%d/%Y"), 
                      StartDate = as.Date(StartDate, "%m/%d/%Y"),
                      DaysToED = as.numeric(election.day - EndDate),
                      Trump = Trump/100,
                      Biden = Biden/100,
                      margin = Biden - Trump)

What are we plotting?

ggplot(data = Pres2020.PV, aes(x = DaysToED)) +
  labs(title = "Number of 2020 National Polls Over Time",
       x = "Number of Days until Election Day",
       y = "Number of Polls") + 
  geom_bar(fill="purple", color= "black") +
  scale_x_continuous(breaks=seq(0,230,by=10)) +
  scale_y_continuous(labels = label_number(accuracy = 1))

So this is a bit weird because it arranges the axis from smallest to largest even though that is in reverse chronological order. Since November comes after January it may make sense to flip the scale so that the graph plots polls that are closer to Election Day as the reader moves along the scale to the left.

ggplot(data = Pres2020.PV, aes(x = DaysToED)) +
  labs(title = "Number of 2020 National Polls Over Time") + 
  labs(x = "Number of Days until Election Day") + 
  labs(y = "Number of Polls") + 
  geom_bar(fill="purple", color= "black") +
  scale_x_reverse(breaks=seq(0,230,by=10)) +
  scale_y_continuous(labels = label_number(accuracy = 1))

But is the bargraph the right plot to use? Should we plot every single day given that we know some days might contain fewer polls (e.g., weekends)? What if we to use a histogram instead to plot the number of polls that occur in a set interval of time (to be defined by the number of bins chosen)?

ggplot(data = Pres2020.PV, aes(x = DaysToED)) +
  labs(title = "Number of 2020 National Polls Over Time") + 
  labs(x = "Number of Days until Election Day") + 
  labs(y = "Number of Polls") + 
  geom_histogram(color="PURPLE",bins = 30) +
  scale_x_reverse()

Which do you prefer? Why or why not? Data Science is sometimes as much art as it is science - especially when it comes to data visualization!

Bivariate/Multivariate relationships

A very frequently used plot is the scatterplot that shows the relationship between two variables. To do so we are going to define an aesthetic when calling ggplot that defines both an x-variable (here EndDate) and a y-variable (here margin).

First, a brief aside, we can change the size of the figure being printed in our Rmarkdown by including some commands when defining the R chunk. Much like we could suppress messages (e.g., message=FALSE or tell R not to actually evaluate the chunk eval=FALSE or to run the code but not print the code echo=FALSE) we can add some arguments to this line. For example, the code below defines the figure to be 2 inches tall, 2 inches wide and to be aligned in the center (as opposed to left-justified). As you can see, depending on the choices you make you can destroy the readability of the graphics as ggplot will attempt to rescale the figure accordingly. The dimensions are in inches and they refer to the plotting area – an area that includes labels and margins so it is not the area where the data itself appears.

margin_over_time_plot <- Pres2020.PV %>%
  ggplot(aes(x = EndDate, y = margin)) + 
  labs(title="Margin in 2020 National Popular Vote Polls Over Time",
       y = "Margin: Biden - Trump",
       x = "Poll Ending Date") + 
  geom_point(color="purple")
margin_over_time_plot

To ``fix” this we can call the ggplot object without defining the graphical parameters.

margin_over_time_plot

Ok, back to fixing the graph. What do you think?

  1. Axes looks weird - lots of interpolation required by the consumer.

  2. Data looks “chunky”? How many data points are at each point?

To fix the axis scale we can use scale_y_continuous to chose better labels for the y-axis and we can use scale_x_date to determine how to plot the dates we are plotting. Here we are going to plot at two-week intervals (date_breaks = "2 week") using labels that include the month and date (date_labels - "%b %d").

Note here that we are going to adjust the plot by adding new features to the ggplot object margin_over_time_plot. We could also have created the graph without creating the object.

margin_over_time_plot <- margin_over_time_plot  + 
    scale_y_continuous(breaks=seq(-.1,.2,by=.05),
                     labels= scales::percent_format(accuracy = 1)) +
    scale_x_date(date_breaks = "2 week", date_labels = "%b %d") 
margin_over_time_plot

Now one thing that is hard to know is how many polls are at a particular point. If some points contain a single poll and others contain 1000 polls that matters a lot for how we interpret the relationship.

To help convey this information we can again use geom_jitter and alpha transparency instead of geom_point. Here we are adding +/- .005 to the y-value to change the value of the poll, but not the date. (Note that we could also use position=jitter when calling geom_point). This adds just enough error in the x (width) and y (height) values associated with each point so as to help distinguish how many observations might share a value. We are also going to use the alpha transparency to denote when lots of points occur on a similar (jittered) point. Note that when doing this code we are going to redo the plot from start to finish.

Pres2020.PV %>%
  ggplot(aes(x = EndDate, y = margin)) + 
  labs(title="Margin in 2020 National Popular Vote Polls Over Time",
       y = "Margin: Biden - Trump",
       x = "Poll Ending Date") + 
    geom_jitter(color = "PURPLE",height=.005, alpha = .4) +
    scale_y_continuous(breaks=seq(-.1,.2,by=.05),
                     labels= scales::percent_format(accuracy = 1)) +
    scale_x_date(date_breaks = "2 week", date_labels = "%b %d") 

In addition to plotting plotting points using geom_point we can also add lines to the plot using geom_line. The line will connect the values in sequence so it does not always make sense to include. For example, if we add the geom_line to the plot it is hard to make the case that the results are meaningful.

Pres2020.PV %>%
  ggplot(aes(x = EndDate, y = margin)) + 
  labs(title="Margin in 2020 National Popular Vote Polls Over Time",
       y = "Margin: Biden - Trump",
       x = "Poll Ending Date") + 
    geom_jitter(color="purple", alpha = .5) + 
    scale_y_continuous(breaks=seq(-.1,.2,by=.05),
                     labels= scales::percent_format(accuracy = 1)) +
    scale_x_date(date_breaks = "2 week", date_labels = "%b %d") +
  geom_line()

While geom_line may help accentuate variation over time, it is really not designed to summarize a relationship in the data as it is simply connecting sequential points (arranged according to the x-axis). In contrast, if we were to add geom_smooth then the plot would add the average value of nearby points to help summarize the trend. (Note that we have opted to include the associated uncertainty in the moving average off using the se=T parameter in geom_smooth.)

# Using message=FALSE to suppress note about geom_smooth
Pres2020.PV %>%
  ggplot(aes(x = EndDate, y = margin)) + 
  labs(title="Margin in 2020 National Popular Vote Polls Over Time",
       y = "Margin: Biden - Trump",
       x = "Poll Ending Date") + 
    geom_jitter(color="purple", alpha = .5) + 
    scale_y_continuous(breaks=seq(-.1,.2,by=.05),
                     labels= scales::percent_format(accuracy = 1)) +
    scale_x_date(date_breaks = "2 week", date_labels = "%b %d") +
    geom_smooth(color = "BLACK", se=T) 

Plotting Multiple Variables Over Time (Time-Series)

Let’s define a ggplot object to add to later. Note that this code chuck will not print anything because we need to call the object to see what it produced.

BidenTrumpplot <- Pres2020.PV %>%
  ggplot()  +
  geom_point(aes(x = EndDate, y = Trump), 
             color = "red", alpha=.4)  +
  geom_point(aes(x = EndDate, y = Biden), 
             color = "blue", alpha=.4) +
  labs(title="% Biden and Trump in 2020 National Popular Vote Polls Over Time",
       y = "Pct. Support",
       x = "Poll Ending Date") +
  scale_x_date(date_breaks = "2 week", date_labels = "%b %d") + 
  scale_y_continuous(breaks=seq(.3,.7,by=.05),
                     labels= scales::percent_format(accuracy = 1)) 

Now call the plot

BidenTrumpplot

Now we are going to add smoothed lines to the ggplot object. The nice thing about having saved the ggplot object above is that to add the smoothed lines we can simply add it to the existing plot. geom_smooth summarizes the average using a subset of the data (the default is 75%) by computing the average value for the closest 75% of the data. Put differently, after sorting the polls by their date it then summarizes the predicted value for a poll taken at that date using the closest 75% of polls according to the date. (It is not taking a mean of those values, but it is doing something similar.) After doing so for the first date, it then does the same for the second date, but now the “closest” data includes polls that happened both before an after. The smoother “moves along” the x-axis and generates a predicted value for each date. Because it is using so much of the data, the values of the line will change very slowly because the only change between two adjacent dates is by dropping and adding new information.

When calling the geom_smooth we used se=T to tell ggplot to produce an estimate of how much the prediction may vary. (This is the 95% confidence interval for the prediction being graphed.)

# Using message=FALSE to suppress note about geom_smooth
BidenTrumpplot +  
  geom_smooth(aes(x = EndDate, y = Trump), 
              color = "red",se=T) + 
  geom_smooth(aes(x = EndDate, y = Biden), 
              color = "blue",se=T)

If we change it to using only the closest 10% of the data by changing span=.1 we get a slightly different relationship. The smoothed lines are less smooth because the impact of adding and removing points is much greater when we are using less data to compute the smoothed value – a single observation can change the overall results much more than when we used so much more data. In addition, the error-bars around the smoother are larger because we are using less data to calculate each point.

# Using message=FALSE to suppress note about geom_smooth
BidenTrumpplot +  
  geom_smooth(aes(x = EndDate, y = Trump), 
              color = "red",se=T,span=.1) + 
  geom_smooth(aes(x = EndDate, y = Biden), 
              color = "blue",se=T,span=.1)

Try some other values to see how things change. Note that you are not changing the data, you are only changing how you are visualizing the relationship over time by deciding how much data to use. In part, the decision of how much data to use is a question of how much you you want to allow public opinion to vary over the course of the campaign – how much of the variation is “real” versus how much is “noise”?

Quick Exercise Choose another span and see how it changes how you interpret how much variation there is in the data over time.

# INSERT CODE

ADVANCED! We can also use fill to introduce another dimension into the visualization. Consider for example, plotting support for Trump over time for mixed-mode polls versus telephone polls versus online-only polls. How would you go about doing this? You want to be careful not to make a mess however. Just because you can doesn’t mean you should!

State Polls and the Electoral College

New functions and libraries:

First load the data of all state-level polls for the 2020 presidential election.

Pres2020.StatePolls <- read_rds(file="https://github.com/rweldzius/PSC7000_F2024/raw/main/Data/Pres2020_StatePolls.Rds")
glimpse(Pres2020.StatePolls)
## Rows: 1,545
## Columns: 19
## $ StartDate      <date> 2020-03-21, 2020-03-24, 2020-03-24, 2020-03-28, 2020-0…
## $ EndDate        <date> 2020-03-30, 2020-04-03, 2020-03-29, 2020-03-29, 2020-0…
## $ DaysinField    <dbl> 10, 11, 6, 2, 3, 5, 2, 2, 7, 3, 3, 3, 2, 2, 3, 4, 10, 1…
## $ MoE            <dbl> 2.8, 3.0, 4.2, NA, 4.0, 1.7, 3.0, 3.1, 4.1, 4.4, NA, NA…
## $ Mode           <chr> "Phone/Online", "Phone/Online", "Live phone - RDD", "Li…
## $ SampleSize     <dbl> 1331, 1000, 813, 962, 602, 3244, 1035, 1019, 583, 500, …
## $ Biden          <dbl> 41, 47, 48, 67, 46, 46, 46, 48, 52, 42, 48, 50, 52, 38,…
## $ Trump          <dbl> 46, 34, 45, 29, 46, 40, 48, 45, 39, 49, 47, 41, 43, 49,…
## $ Winner         <chr> "Rep", "Dem", "Dem", "Dem", "Dem", "Rep", "Dem", "Dem",…
## $ poll.predicted <dbl> 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Funded         <chr> "UtahPolicy.com & KUTV 2News", "Sacred Heart University…
## $ Conducted      <chr> "Y2 Analytics", "GreatBlue Research", "LHK Partners Inc…
## $ margin         <dbl> -5, 13, 3, 38, 0, 6, -2, 3, 13, -7, 1, 9, 9, -11, 6, -1…
## $ DaysToED       <drtn> 218 days, 214 days, 219 days, 219 days, 216 days, 213 …
## $ StateName      <chr> "Utah", "Connecticut", "Wisconsin", "California", "Mich…
## $ EV             <int> 6, 7, 10, 55, 16, 29, 16, 16, 12, 15, 10, 16, 11, 6, 16…
## $ State          <chr> "UT", "CT", "WI", "CA", "MI", "FL", "GA", "MI", "WA", "…
## $ BidenCertVote  <dbl> 38, 59, 49, 64, 51, 48, 50, 51, 58, 49, 49, 51, 49, 41,…
## $ TrumpCertVote  <dbl> 58, 39, 49, 34, 48, 51, 49, 48, 39, 50, 49, 48, 49, 58,…

Variables of potential interest include:

Overall Task:

Task 1: How should we translate a poll result into a predicted probability?

Suppose that I give you 10 polls from a state.

Load in the data and create the some mutations to create new variables.

Pres2020.StatePolls <- Pres2020.StatePolls %>%
   mutate(BidenNorm = Biden/(Biden+Trump),
          TrumpNorm = 1-BidenNorm,
          Biden = Biden/100,
          Trump=Trump/100)

How can you use them to create a probability? Discuss! (I can think of 3 ways.)

  • Measure 1: Fraction of polls with Biden in the lead
  • Measure 2: Biden Pct = Probability Biden wins
  • Measure 3: Normalized Biden Pct = Probability Biden wins (i.e., all voters either vote for Biden or Trump). Sometimes called “two-party” vote.

How does this vary across states? The joys of group_by(). Note that group_by() defines what happens for all subsequent code in that code chunk. So here we are going to calculate the mean separately for each state.

stateprobs <- Pres2020.StatePolls %>%
    group_by(StateName) %>%
      summarize(BidenProbWin1 = mean(Biden > Trump),
                BidenProbWin2 = mean(Biden),  
                BidenProbWin3 = mean(BidenNorm))

stateprobs
## # A tibble: 50 × 4
##    StateName   BidenProbWin1 BidenProbWin2 BidenProbWin3
##    <chr>               <dbl>         <dbl>         <dbl>
##  1 Alabama             0             0.389         0.407
##  2 Alaska              0             0.442         0.466
##  3 Arizona             0.840         0.484         0.519
##  4 Arkansas            0             0.381         0.395
##  5 California          1             0.618         0.661
##  6 Colorado            1             0.534         0.571
##  7 Connecticut         1             0.584         0.631
##  8 Delaware            1             0.603         0.627
##  9 Florida             0.798         0.486         0.517
## 10 Georgia             0.548         0.474         0.504
## # ℹ 40 more rows

Clearly they differ, so let’s visualize to try to understand what is going on. Install the library plotly

library(plotly)
gg <- stateprobs %>%
  ggplot(aes(x=BidenProbWin2, y=BidenProbWin3,text=paste(StateName))) +
  geom_point() +
  geom_abline(intercept=0,slope=1) +
  labs(x= "Probability as % Support",
       y = "Probability as Two-Party % Support",
       title = "Comparing Probability of Winning Measures")

ggplotly(gg,tooltip = "text")

So removing the undecided and making the probabilities for Biden and Trump sum to 100% is consequential.

What about if we compare these measures to the fration of polls with a given winner? After all, it seems implausible that the Biden would ever lose California or Trump would ever lose Tennessee.

library(plotly)
gg <- stateprobs %>%
  ggplot(aes(x=BidenProbWin2, y=BidenProbWin1,text=paste(StateName))) +
  geom_point() +
  geom_abline(intercept=0,slope=1) +
  labs(x= "Probability as % Support",
       y = "Probability as % Polls Winning",
       title = "Comparing Probability of Winning Measures")

ggplotly(gg,tooltip = "text")

So what do you think? Exactly the same data, but just different impications depending on how you choose to measure the probability of winning a state. Data sciene is as much about argument and reasoning as it is about coding. How we measure a concept is often critical to the conclusions that we get.

Task 2: Start “simple” – calculate the probability that Biden wins PA

But we want to combine these probabilities with the Electoral College votes in each state. Not every state has the same amount of Electoral College votes – it is typically given by the number of Senators (2) plus the number of representatives (at least 1) so we need to account for this if we want to make a projection about who is going to win the Electoral College.

  • Create a tibble with just polls from PA.
PA.dat <- Pres2020.StatePolls %>% 
  filter(State == "PA")
  • Now compute these three probabilities. What functions do we need?
PA.dat %>%
      summarize(BidenProbWin1 = mean(Biden > Trump),
                BidenProbWin2 = mean(Biden),  
                BidenProbWin3 = mean(BidenNorm))
## # A tibble: 1 × 3
##   BidenProbWin1 BidenProbWin2 BidenProbWin3
##           <dbl>         <dbl>         <dbl>
## 1         0.916         0.499         0.529
  • What do you think about this?

Task 3: Given that probability, how do we change the code to compute the expected number of Electoral College Votes EV for Biden?

  • Keep the code from above and copy and paste so you can understand how each step changes what we are doing. Note that we have the number of electoral college votes associated with each state EV that we want to use to compute the expected number of electoral college votes. But recall that when we summarize we change the tibble to be the output of the function. So how do we keep the number of Electoral College votes for a future mutation?
PA.dat %>%
      summarize(BidenProbWin1 = mean(Biden > Trump),
                BidenProbWin2 = mean(Biden),
                BidenProbWin3 = mean(BidenNorm),
                EV = mean(EV)) %>%
      mutate(BidenEV1 = BidenProbWin1*EV,
             BidenEV2 = BidenProbWin2*EV,
             BidenEV3 = BidenProbWin3*EV)
## # A tibble: 1 × 7
##   BidenProbWin1 BidenProbWin2 BidenProbWin3    EV BidenEV1 BidenEV2 BidenEV3
##           <dbl>         <dbl>         <dbl> <dbl>    <dbl>    <dbl>    <dbl>
## 1         0.916         0.499         0.529    20     18.3     9.98     10.6

Note that we are calculation the Expected Value of the Electoral College votes using: Probability that Biden wins state i X Electoral College Votes in State i. This will allocate fractions of Electoral College votes even though the actual election is winner-take all. This is OK because the fractions reflect the probability that an alternative outcome occurs.

Quick Exercise How can we get compute the expected number of Electoral College votes for Trump in each measure? NOTE: There are at least 2 ways to do this because this is a 2 candidate race

# INSERT CODE HERE
  • EV-BidenEV, or compute TrumpProbWin

Task 4: Now generalize to every state by applying this code to each set of state polls.

  • What do we need to do this calculation for every state in our tibble?
  • First, compute probability of winning a state. (How?)
  • Second, compute expected Electoral College Votes. (How?)
Pres2020.StatePolls %>%  
  group_by(StateName) %>%
    summarize(BidenProbWin1 = mean(Biden > Trump),
              BidenProbWin3 = mean(BidenNorm),
              EV = mean(EV),
              State = first(State)) %>%
    mutate(State = State,
              BidenECVPredicted1 = EV*BidenProbWin1,
              TrumpECVPredicted1 = EV- BidenECVPredicted1,
              BidenECVPredicted3 = EV*BidenProbWin3,
              TrumpECVPredicted3 = EV- BidenECVPredicted3) %>%
  summarize(BidenECVPredicted1=sum(BidenECVPredicted1),
            BidenECVPredicted3=sum(BidenECVPredicted3),
            TrumpECVPredicted1=sum(TrumpECVPredicted1),
            TrumpECVPredicted3=sum(TrumpECVPredicted3),)
## # A tibble: 1 × 4
##   BidenECVPredicted1 BidenECVPredicted3 TrumpECVPredicted1 TrumpECVPredicted3
##                <dbl>              <dbl>              <dbl>              <dbl>
## 1               345.               289.               190.               246.

Task 5: Now compute total expected vote by adding to that code

  • NOTE: Actually 306 - 232
  • What do we need to do to the tibble we created in Task 4 to get the overall number of Electoral College Votes?
Pres2020.StatePolls %>%  
  group_by(StateName) %>%
    summarize(BidenProbWin1 = mean(Biden > Trump),
              BidenProbWin3 = mean(BidenNorm),
              EV = mean(EV)) %>%
    mutate(BidenECVPredicted1 = EV*BidenProbWin1,
              TrumpECVPredicted1 = EV- BidenECVPredicted1,
              BidenECVPredicted3 = EV*BidenProbWin3,
              TrumpECVPredicted3 = EV- BidenECVPredicted3) %>%
    summarize(BidenECV1 = sum(BidenECVPredicted1),
              TrumpECV1 = sum(TrumpECVPredicted1),
              BidenECV3 = sum(BidenECVPredicted3),
              TrumpECV3 = sum(TrumpECVPredicted3))
## # A tibble: 1 × 4
##   BidenECV1 TrumpECV1 BidenECV3 TrumpECV3
##       <dbl>     <dbl>     <dbl>     <dbl>
## 1      345.      190.      289.      246.

Quick Exercise Could also do this for just polls conducted in the last 7 days. How?

# INSERT CODE HERE

THINKING: What about states that do not have any polls? What should we do about them? Is there a reason why they might not have a poll? Is that useful information? Questions like this become more relevant when we start to restrict the sample.

Here are the number of polls done in each state in the last 3 days. Note that when we use fewer days our measure based on the percentage of polls won may be more affected?

Pres2020.StatePolls %>%
  filter(DaysToED < 3) %>%
  count(State) %>%
  ggplot(aes(x=n)) +
  geom_bar() + 
  scale_x_continuous(breaks=seq(0,15,by=1)) +
  labs(x="Number of Polls in a State",
       y="Number of States",
       title="Number of Polls in States \n in the Last 3 Days of 2020")